透明的物体广泛用于工业自动化和日常生活中。但是,强大的视觉识别和对透明物体的感知一直是一个主要挑战。目前,由于光的折射和反射,大多数商用级深度摄像机仍然不擅长感知透明物体的表面。在这项工作中,我们从单个RGB-D输入中提出了一种基于变压器的透明对象深度估计方法。我们观察到,变压器的全球特征使得更容易提取上下文信息以执行透明区域的深度估计。此外,为了更好地增强细粒度的特征,功能融合模块(FFM)旨在帮助连贯的预测。我们的经验证据表明,与以前的最新基于卷积的数据集相比,我们的模型在最近的流行数据集中有了重大改进,例如RMSE增长25%,RER增长21%。广泛的结果表明,我们的基于变压器的模型可以更好地汇总对象的RGB和不准确的深度信息,以获得更好的深度表示。我们的代码和预培训模型将在https://github.com/yuchendoudou/tode上找到。
translated by 谷歌翻译
我们介绍了Lavis,这是一个开源深度学习库,用于语言视觉研究和应用。拉维斯(Lavis)的目标是作为一个一站式综合图书馆,它为研究人员和从业人员提供了可访问语言视觉领域的最新进步,并赋予未来的研究和发展。它具有统一的界面,可轻松访问最新的图像语言,视频语言模型和常见数据集。 Lavis支持对各种任务的培训,评估和基准测试,包括多模式分类,检索,字幕,视觉问题答案,对话和预训练。同时,该库还高度可扩展且可配置,从而促进了未来的开发和定制。在此技术报告中,我们描述了图书馆的设计原理,关键组成部分和功能,并在常见的语言视觉任务中提出基准测试结果。该库可在以下网址获得:https://github.com/salesforce/lavis。
translated by 谷歌翻译
在这项工作中,我们提出了一种新的方法,用于利用极化线索来详细地重建透明对象。大多数现有方法通常缺乏足够的限制,并且遭受了过度平滑的问题。因此,我们将极化信息作为互补提示引入。我们将对象的几何形状隐式表示为神经网络,而极化渲染能够从给定的形状和照明配置中呈现对象的极化图像。由于透明对象的传输,将渲染的极化图像与现实世界捕获的图像进行直接比较将存在其他错误。为了解决这个问题,引入了代表反射部分比例的反射百分比的概念。反射百分比由射线示踪剂计算,然后用于加权极化损失。我们为多视图透明形状重建构建极化数据集以验证我们的方法。实验结果表明,我们的方法能够恢复详细的形状并提高透明物体的重建质量。我们的数据集和代码将在https://github.com/shaomq2187/transpir上公开获得。
translated by 谷歌翻译
这项工作研究了彩色的任务,其中目的是将聋人(听力态度)社区转录到聋人的自然口语句子,以命令手语界面。以配对句子 - 光泽数据培训的先前序列到序列语言模型通常无法捕获两个不同语言之间的丰富连接,从而导致不满意的转录。我们观察到,尽管语法不同,但有效地简化了聋人通信的句子,同时与句子分享大部分词汇。这使我们能够通过执行编辑动作的集合来实现有乐化性的。单词添加,删除和复制,称为编辑程序,在他们的自然语言同行上。具体而言,我们设计了一种新的神经代理,了解综合和执行编辑程序,在句子上下文和部分编辑结果上调节的编辑程序。经过培训的代理以模仿最小的编辑程序,同时通过策略梯度更广泛地探索节目空间,以优化序列明智的转录质量。结果表明,我们的方法优于先前的光泽模型。
translated by 谷歌翻译
由于存在于视觉信号采集,压缩,传输和显示的各个阶段的质量降级,图像质量评估(IQA)在基于图像的应用中起着重要作用。根据参考图像是否完整且可用,图像质量评估可分为三类:全引用(FR),减少参考(RR)和非引用(NR)。本文将审查最先进的图像质量评估算法。
translated by 谷歌翻译
We present Spider, a large-scale, complex and cross-domain semantic parsing and textto-SQL dataset annotated by 11 college students. It consists of 10,181 questions and 5,693 unique complex SQL queries on 200 databases with multiple tables, covering 138 different domains. We define a new complex and cross-domain semantic parsing and textto-SQL task where different complex SQL queries and databases appear in train and test sets. In this way, the task requires the model to generalize well to both new SQL queries and new database schemas. Spider is distinct from most of the previous semantic parsing tasks because they all use a single database and the exact same programs in the train set and the test set. We experiment with various state-of-the-art models and the best model achieves only 12.4% exact matching accuracy on a database split setting. This shows that Spider presents a strong challenge for future research. Our dataset and task are publicly available at https://yale-lily. github.io/spider.
translated by 谷歌翻译
Large language models (LLMs) have demonstrated excellent zero-shot generalization to new language tasks. However, effective utilization of LLMs for zero-shot visual question-answering (VQA) remains challenging, primarily due to the modality disconnection and task disconnection between LLM and VQA task. End-to-end training on vision and language data may bridge the disconnections, but is inflexible and computationally expensive. To address this issue, we propose \emph{Img2Prompt}, a plug-and-play module that provides the prompts that can bridge the aforementioned modality and task disconnections, so that LLMs can perform zero-shot VQA tasks without end-to-end training. In order to provide such prompts, we further employ LLM-agnostic models to provide prompts that can describe image content and self-constructed question-answer pairs, which can effectively guide LLM to perform zero-shot VQA tasks. Img2Prompt offers the following benefits: 1) It can flexibly work with various LLMs to perform VQA. 2)~Without the needing of end-to-end training, it significantly reduces the cost of deploying LLM for zero-shot VQA tasks. 3) It achieves comparable or better performance than methods relying on end-to-end training. For example, we outperform Flamingo~\cite{Deepmind:Flamingo2022} by 5.6\% on VQAv2. On the challenging A-OKVQA dataset, our method even outperforms few-shot methods by as much as 20\%.
translated by 谷歌翻译
Zero-shot relation triplet extraction (ZeroRTE) aims to extract relation triplets from unstructured texts under the zero-shot setting, where the relation sets at the training and testing stages are disjoint. Previous state-of-the-art method handles this challenging task by leveraging pretrained language models to generate data as additional training samples, which increases the training cost and severely constrains the model performance. To address the above issues, we propose a novel method named PCRED for ZeroRTE with Potential Candidate Relation Selection and Entity Boundary Detection. The remarkable characteristic of PCRED is that it does not rely on additional data and still achieves promising performance. The model adopts a relation-first paradigm, recognizing unseen relations through candidate relation selection. With this approach, the semantics of relations are naturally infused in the context. Entities are extracted based on the context and the semantics of relations subsequently. We evaluate our model on two ZeroRTE datasets. The experiment results show that our method consistently outperforms previous works. Our code will be available at https://anonymous.4open.science/r/PCRED.
translated by 谷歌翻译
视频和语言预培训表明对各种下游任务有望改善。最先前的方法捕获与基于变换器的多模式编码器的跨模型交互,不完全解决单向视频和文本特征之间的错位。此外,学习细粒度的视觉语言对准通常需要离上的对象检测器来提供对象信息,这是由检测器有限的词汇和昂贵的计算成本的瓶颈。我们建议对齐和提示:一种高效有效的视频和语言预训练框架,具有更好的跨模型对齐。首先,我们介绍了一个视频文本对比(VTC)丢失,以对准实例级别的单峰视频文本功能,从而缓解跨模型交互的建模。然后,我们提出了一种新的视觉接地预训练任务,提示实体建模(PEM),旨在学习细粒度的区域实体对齐。为实现这一目标,我们首先介绍一个实体发射模块,该模块用VTC培训,以产生与实体名称实例化的视频裁剪和文本提示之间的相似性。 PEM任务然后询问模型以预测随机选择的视频作物的实体伪标签(I.E〜归一化相似度分数)。由此产生的预先训练的模型在文本 - 视频检索和VideoQ上实现了最先进的性能,通过大幅度的边距表现优于现有的工作。我们的代码和预先训练的型号将被释放。
translated by 谷歌翻译
人类姿势转移旨在将源人的外观转移到目标姿势。利用基于流量的非刚性人类图像的翘曲的现有方法取得了巨大的成功。然而,由于源和目标之间的空间相关性未充分利用,它们未能保留合成图像中的外观细节。为此,我们提出了基于流动的双重关注GaN(FDA-GaN),以应用于更高的发电质量的遮挡和变形感知功能融合。具体而言,可变形的局部注意力和流量相似性关注,构成双重关注机制,可以分别导出负责可变形和遮挡感知融合的输出特征。此外,为了维持传输的姿势和全球位置一致性,我们设计了一种姿势归一化网络,用于从目标姿势到源人员学习自适应标准化。定性和定量结果都表明,我们的方法在公共IPer和Deepfashion数据集中优于最先进的模型。
translated by 谷歌翻译